A Chinese Dataset with Negative Full Forms for General Abbreviation Prediction
نویسندگان
چکیده
Abbreviation is a common phenomenon across languages, especially in Chinese. In most cases, if an expression can be abbreviated, its abbreviation is used more often than its fully expanded forms, since people tend to convey information in a most concise way. For various language processing tasks, abbreviation is an obstacle to improving the performance, as the textual form of an abbreviation does not express useful information, unless it’s expanded to the full form. Abbreviation prediction means associating the fully expanded forms with their abbreviations. However, due to the deficiency in the abbreviation corpora, such a task is limited in current studies, especially considering general abbreviation prediction should also include those full form expressions that do not have valid abbreviations, namely the negative full forms (NFFs). Corpora incorporating negative full forms for general abbreviation prediction are few in number. In order to promote the research in this area, we build a dataset for general Chinese abbreviation prediction, which needs a few preprocessing steps, and evaluate several different models on the built dataset. The dataset is available at https://github.com/lancopku/ Chinese-abbreviation-dataset.
منابع مشابه
Generalized Abbreviation Prediction with Negative Full Forms and Its Application on Improving Chinese Web Search
In Chinese abbreviation prediction, prior studies are limited on positive full forms. This lab assumption is problematic in realworld applications, which have a large portion of negative full forms (NFFs). We propose solutions to solve this problem of generalized abbreviation prediction. Experiments show that the proposed unified method outperforms baselines, with the full-match accuracy of 79....
متن کاملMining atomic Chinese abbreviations with a probabilistic single character recovery model
An HMM-based single character recovery (SCR) model is proposed in this paper to extract a large set of atomic abbreviations and their full forms from a text corpus. By an ‘‘atomic abbreviation,’’ it refers to an abbreviated word consisting of a single Chinese character. This task is important since Chinese abbreviations cannot be enumerated exhaustively but the abbreviation process for compound...
متن کاملCoarse-grained Candidate Generation and Fine-grained Re-ranking for Chinese Abbreviation Prediction
Correctly predicting abbreviations given the full forms is important in many natural language processing systems. In this paper we propose a two-stage method to find the corresponding abbreviation given its full form. We first use the contextual information given a large corpus to get abbreviation candidates for each full form and get a coarse-grained ranking through graph random walk. This coa...
متن کاملConstructing Chinese Abbreviation Dictionary: A Stacked Approach
Abbreviation is a common linguistic phenomenon with wide popularity and high rate of growth. Correctly linking full forms to their abbreviations will be helpful in many applications. For example, it can improve the recall of information retrieval systems. An intuition to solve this is to build an abbreviation dictionary in advance. This paper investigates an automatic abbreviation generation me...
متن کاملPredicting Chinese Abbreviations with Minimum Semantic Unit and Global Constraints
We propose a new Chinese abbreviation prediction method which can incorporate rich local information while generating the abbreviation globally. Different to previous character tagging methods, we introduce the minimum semantic unit, which is more fine-grained than character but more coarse-grained than word, to capture word level information in the sequence labeling framework. To solve the “ch...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1712.06289 شماره
صفحات -
تاریخ انتشار 2017